human value
- Asia > China > Beijing > Beijing (0.04)
- North America > United States > Oregon (0.04)
- Europe > Monaco (0.04)
- Asia > Middle East > Jordan (0.04)
- Law (0.93)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.68)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.46)
- Asia > Singapore (0.29)
- Europe > Austria > Vienna (0.14)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- (7 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.92)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Government (0.92)
- (4 more...)
Learning Human-like Representations to Enable Learning Human Values Andrea H. Wynn
How can we build AI systems that can learn any set of individual human values both quickly and safely, avoiding causing harm or violating societal standards for acceptable behavior during the learning process? We explore the effects of representational alignment between humans and AI agents on learning human values.
- North America > United States > New Jersey > Mercer County > Princeton (0.04)
- North America > United States > Maryland > Baltimore (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Education (1.00)
- Transportation > Passenger (0.46)
- Transportation > Ground > Road (0.46)
- Government > Regional Government > North America Government > United States Government (0.69)
- Banking & Finance > Economy (0.69)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
- (6 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.67)
- Information Technology (0.67)
- Health & Medicine (0.46)
Value Imprint: A Technique for Auditing the Human Values Embedded in RLHF Datasets
LLMs are increasingly fine-tuned using RLHF datasets to align them with human preferences and values. However, very limited research has investigated which specific human values are operationalized through these datasets. In this paper, we introduce Value Imprint, a framework for auditing and classifying the human values embedded within RLHF datasets. To investigate the viability of this framework, we conducted three case study experiments by auditing the Anthropic/hh-rlhf, OpenAI WebGPT Comparisons, and Alpaca GPT-4-LLM datasets to examine the human values embedded within them. Our analysis involved a two-phase process.
ALI-Agent: Assessing LLMs' Alignment with Human Values via Agent-based Evaluation
To mitigate these risks, current evaluation benchmarks predominantly employ expert-designed contextual scenarios to assess how well LLMs align with human values. However, the labor-intensive nature of these benchmarks limits their test scope, hindering their ability to generalize to the extensive variety of open-world use cases and identify rare but crucial long-tail risks. Additionally, these static tests fail to adapt to the rapid evolution of LLMs, making it hard to evaluate timely alignment issues. To address these challenges, we propose ALI-Agent, an evaluation framework that leverages the autonomous abilities of LLM-powered agents to conduct in-depth and adaptive alignment assessments. ALI-Agent operates through two principal stages: Emulation and Refinement.
Learning Human-like Representations to Enable Learning Human Values
How can we build AI systems that can learn any set of individual human values both quickly and safely, avoiding causing harm or violating societal standards for acceptable behavior during the learning process? We explore the effects of representational alignment between humans and AI agents on learning human values. Making AI systems learn human-like representations of the world has many known benefits, including improving generalization, robustness to domain shifts, and few-shot learning performance. We demonstrate that this kind of representational alignment can also support safely learning and exploring human values in the context of personalization. We begin with a theoretical prediction, show that it applies to learning human morality judgments, then show that our results generalize to ten different aspects of human values -- including ethics, honesty, and fairness -- training AI agents on each set of values in a multi-armed bandit setting, where rewards reflect human value judgments over the chosen action. Using a set of textual action descriptions, we collect value judgments from humans, as well as similarity judgments from both humans and multiple language models, and demonstrate that representational alignment enables both safe exploration and improved generalization when learning human values.
Second Thoughts are Best: Learning to Re-Align With Human Values from Text Edits
We present Second Thoughts, a new learning paradigm that enables language models (LMs) to re-align with human values. By modeling the chain-of-edits between value-unaligned and value-aligned text, with LM fine-tuning and additional refinement through reinforcement learning, Second Thoughts not only achieves superior performance in three value alignment benchmark datasets but also shows strong human-value transfer learning ability in few-shot scenarios. The generated editing steps also offer better interpretability and ease for interactive error correction. Extensive human evaluations further confirm its effectiveness.
Counterfactual Reasoning for Steerable Pluralistic Value Alignment of Large Language Models
Guo, Hanze, Yao, Jing, Zhou, Xiao, Yi, Xiaoyuan, Xie, Xing
As large language models (LLMs) become increasingly integrated into applications serving users across diverse cultures, communities and demographics, it is critical to align LLMs with pluralistic human values beyond average principles (e.g., HHH). In psychological and social value theories such as Schwartz's Value Theory, pluralistic values are represented by multiple value dimensions paired with various priorities. However, existing methods encounter two challenges when aligning with such fine-grained value objectives: 1) they often treat multiple values as independent and equally important, ignoring their interdependence and relative priorities (value complexity); 2) they struggle to precisely control nuanced value priorities, especially those underrepresented ones (value steerability). To handle these challenges, we propose COUPLE, a COUnterfactual reasoning framework for PLuralistic valuE alignment. It introduces a structural causal model (SCM) to feature complex interdependency and prioritization among features, as well as the causal relationship between high-level value dimensions and behaviors. Moreover, it applies counterfactual reasoning to generate outputs aligned with any desired value objectives. Benefitting from explicit causal modeling, COUPLE also provides better interpretability. We evaluate COUPLE on two datasets with different value systems and demonstrate that COUPLE advances other baselines across diverse types of value objectives.
- North America > United States (0.04)
- Europe > Russia (0.04)
- Europe > Netherlands (0.04)
- (8 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)